* =============================================================================
* File Name: 2016_EC_EnvRegulation--CWSDataCleaning_useASMCMA.do

* File Description: This file cleans the variables to be used in the CWS analysis. 
* It also creates sample weights used in the analysis. 
* This file is for the dataset created with the ASM CMA definitions.  

* Notes: The data called by this file (asm_npri_2002to2012_withAQ.dta) 
* was created using this do file (2016_EC_EnvRegulation--MergeCMAAirQuality_March2019.do). 
* Two variables are created in that do file: "CMA_ID_Merge", and "Year_Merge". 
* The air quality variables ("O3_CMA", "O3_CMA_3YearAve", "PM25_CMA", and 
* "PM25_CMA_3YearAve") define PM 2.5 and ozone concentrations by CMA-Year. These 
* variables originally came from the "CMA_PM25andO3.dta" file. All other 
* variables used in this file are created herein. 

* Thise file creates two dta files:
* 	- npri_asm_envreg_ASMAirQuality_full_OnlyContempEmitters.dta
* 	- npri_asm_envreg_ASMAirQuality_analysis_OnlyContempEmitters.dta

* The first file is the "full" dataset. It includes every plant in the ASM. This is
* used for some of the robustness checks and descriptive statistics.

* The second file is the "analysis" dataset. It drops any plants that weren't matched
* to the NPRI. Most of the analysis is run on this dataset.   

* Creation date: May 12, 2016

* This version: March 6, 2019

* Author: Nouri Najjar
* =============================================================================

* Loop over each pollutant
foreach x in PM25 {

* -----------------------------------------------------------------------------
* Section 1: Call the npri-asm data with air quality from the cder server. 

* This calls the npri-asm data with CMA-year level air quality. It was created by 
* merging CMA-year level air quality into the original npri-asm by matching
* on plants' CMAs as reported in the NPRI (not the ASM). 
* -----------------------------------------------------------------------------

* Set server folder. 
local datadir \\f4cder01\2016_EC_EnvRegulation\DATA\

* Open dataset. Use the NPRI CMA Matched Data. 
use "`datadir'asm_npri_2002to2012`x'_withAQ.dta", clear
* -----------------------------------------------------------------------------


* -----------------------------------------------------------------------------
* Section 2: Create variables needed for the analysis.  

* Creates required identifiers, treatment, fixed effects, and clustering 
* variables. 
* -----------------------------------------------------------------------------
********
* New plant and pollutant-plant identfiers. 
********
* Create duplicate plant-year indicator. Some plants emit multiple pollutants, 
* and so have multiple entries in a given year. This takes a value > 1 for 
* duplicate plant-years. 
bysort s_meadloc yr4: gen dupPlantYear = cond(_N==1,0,_n)

* Create a numeric unique identifier for plants in the npri. 
gen NPRI_ID = real(npri_id) 

* Define plant identifier according to NPRI. "PlantID" is called in the remainder
* of the code to refer to plants. 
local PlantID NPRI_ID

* Duplicate indicator for pollutant-plants. Used to tabulate median emissions by 
* plant.
bysort NPRI_ID cas_number: gen dupPolPlant = cond(_N==1,0,_n)
********


********
* Fixed effect and clustering variables.

* Creates variables used for industry-year and CMA-year fixed effects. Creates
* the variables used for clustering by CMA-industry. 
********
* CMA-Year indicator for fixed effects.
egen CMA_Year = group(CMA_ID_Merge yr4)
********

*******
* Analysis Variables

* Creates the treatment indicators for the CWS standards. Separate treatment 
* variables are created for the PM 2.5 and Ozone standards. 
*******
* Indicator for CMA-Years that exceed the ozone standard (ozone air quality 
* greater than 65 ppb). 
gen O3_AboveThreshold = cond(O3_CMA>=65 & missing(O3_CMA)==0 & REP_PERIOD>=2000,1,0) if _merge==3

* Indicator for CMA-Years that exceed the PM 2.5 standard (PM 2.5 air quality 
* greater than 30 micrograms per cubic meter)
gen PM25_AboveThreshold = cond(PM25_CMA>=30 & missing(PM25_CMA)==0 & REP_PERIOD>=2000,1,0) if _merge==3

* Target Industries
gen TargetInd = 1 if(naics3 == 322 | naics3 == 321 | naics4 == 3311 | naics4 == 3312 | naics4 == 3313 | naics4 == 3314 | naics4 == 3273 | naics4 == 3241)
recode TargetInd (.=0)

* Indicator for plants treated by the ozone standard. Takes a value of one for 
* plants in targeted industries in a CMA-year with air quality above the ozone standard. 
* Define two versions: the version only using plants with valid CMA air quality,
* and a version that includes CMAs that didn't have valid air quality data. All 
* plants in these CMAs without air quality are considered unregulated.  
gen O3Treated = TargetInd*O3_AboveThreshold

* Indicator for plants treated by the PM 2.5 standard. Takes a value of one for 
* plants in targeted industries in a CMA-year with air quality above the PM 2.5
*  standard. Define two versions: the version only using plants with valid CMA 
* air quality, and a version that includes CMAs that didn't have valid air 
* quality data. All plants in these CMAs without air quality are considered unregulated.  
gen PM25Treated = TargetInd*PM25_AboveThreshold
*******

*******
* Other variables used in analysis. 
*******
* Create indicator for CMAs that exceed the PM 2.5 standard at least once. 
* Sum number of observations in CMA treated by PM standard. Next, set 
* indicator to one for all CMAs with more than one treated observation.
bysort CMA_ID_Merge: egen SumPM25_AboveThreshold = total(PM25_AboveThreshold)
gen CMA_AbovePM25_Once = cond(SumPM25_AboveThreshold>0,1,0)
drop SumPM25_AboveThreshold

* Indicator for CMAs that exceed O3 standard once
* First, sum number of observations in CMA treated by O3 standard. Next, set 
* indicator to one for all CMAs with more than one treated observation.
bysort CMA_ID_Merge: egen SumO3_AboveThreshold = total(O3_AboveThreshold)
gen CMA_AboveO3_Once = cond(SumO3_AboveThreshold>0,1,0)
drop SumO3_AboveThreshold
*******

* -----------------------------------------------------------------------------



* -----------------------------------------------------------------------------
* Section 3: Clean pollution and economic variables. 

* This section cleans the existing pollution and economic variables, and creates 
* some extra variables. 
* -----------------------------------------------------------------------------
********
* Clean the pollution variable by converting to tonnes and taking natural logs.
******** 
* Convert total air emissions to tonnes. NOTE: use total_all_ variable
gen total_air_converted = total_all_ if units=="tonnes"
replace total_air_converted = 0.001*total_all_ if units=="kg"
replace total_air_converted = 0.000001*total_all_ if units=="grams"

* Take logs of pollution, setting zero as missing.
gen LN_total_air_converted = ln(total_air_converted)
********


********
* Clean the output variable (vsm)
********
* Create real output. Deflate using the output deflater, po. 
gen real_vsm = vsm*po

* Take logs, leave zeros as missing. 
gen LN_real_vsm = ln(real_vsm)
********

********
* Clean the total employment variable (totalemp)
********
* Take logs, dropping plants with zero employment. 
gen LN_totalemp = ln(totalemp)
********

********
* Clean the value added manufacturing (vam) variable and create labour productivity
* as the ratio of value added to number of workers.
********
* Create real value added manufacturing. Deflate using the value added deflater, pq. 
gen real_vam = vam*pq
gen LN_real_vam = ln(real_vam)

* Compute labour productivity only for plants with positive employment.
gen VAPerWorker = real_vam/totalemp
gen LN_VAPerWorker = ln(VAPerWorker)
********

********
* Clean sales per worker
********
gen LN_SalesPerWorker = ln(real_vsm/totalemp)
********

********
* Clean poduction material spending
********
gen LN_prodmat = ln(prodmat)
********

********
* Clean non production material spending
********
gen NonProductionMatCosts = tmatcost-prodmat
gen LN_NonProductionMatCosts = ln(NonProductionMatCosts)
********

********
* Clean total energy spending
********
gen LN_heatpr = ln(heatpr)
********

********
* Clean R&D spending
********
gen LN_RD = ln(c4251)
gen RD_Indicator = cond(c4251>0 & missing(c4251)==0,1,0)
********

********
* Create pollution intensity variables
********
* Compute emissions per dollar sales
gen PollutionPerSales = total_air_converted/real_vsm
gen LN_PollutionPerSales = ln(PollutionPerSales)

* Compute emissions per dollar VA
gen PollutionPerVA = total_air_converted/real_vam
gen LN_PollutionPerVA = ln(PollutionPerVA)
********

******
* Export variables. 
* Creates revenues from exports, using the fraction of sales variables reported in ASM.
******
* Create fraction of sales in other countries (sum of US, Mexico, EU, Asia, and other countries sales).
egen SalesFrac_Export = rowtotal(c8465 c8470 c8476 c8477 c8478), missing
* One value exceeds 100, set it to 100. 
replace SalesFrac_Export = 100 if SalesFrac_Export >100 & missing(SalesFrac_Export)==0

* Create export revenues. Computed as the fraction of sales from exports, 
* multiplied by total sales. 
gen ExportRevenue = (SalesFrac_Export*real_vsm)/100
gen LN_ExportRevenue = ln(ExportRevenue)

* Create indicator for plant-years that export. Takes a value of one for all 
* plant-years with positive exports.  
gen ExportIndicator = cond(ExportRevenue>0 & missing(ExportRevenue)==0,1,0)



* Create US sales. Computed as the fraction of sales in US multiplied by total sales. 
gen USRevenue = c8465*real_vsm
gen LN_USRevenue = ln(USRevenue)

* Create ROW sales. Computed as the fraction of sales in Non-US and non-Canada
* countries multiplied by total sales.
egen SalesFrac_ROWExport = rowtotal(c8470 c8476 c8477 c8478), missing
gen ROWRevenue = SalesFrac_ROWExport*real_vsm
gen LN_ROWRevenue = ln(ROWRevenue)
drop SalesFrac_ROWExport
******

******
* Domestic sales variables.
* Sales by own province, neighbour, and non-neighbouring provinces.  
******
* Fraction of output sold in Canada. Computed as sum of sales by province. 
egen SalesFrac_Canada = rowtotal(c8400 c8405 c8410 c8415 c8420 c8425 c8430 c8435 c8440 c8445 c8451 c8452 c8455), missing
replace SalesFrac_Canada = 100 if SalesFrac_Canada>100 & missing(SalesFrac_Canada)==0

* Revenue from domestic sales. Computed as fraction of sales in Canada times total sales. 
gen DomesticRevenue = (SalesFrac_Canada*real_vsm)/100
gen LN_DomesticRevenue = ln(DomesticRevenue)

* Indicator for plant that sells domestically. Takes a value of one for all plant-
* years with positive sales in Canada. 
gen DomesticIndicator = 1 if DomesticRevenue>0 & missing(DomesticRevenue)==0
recode DomesticIndicator (.=0) if DomesticRevenue==0


* Create % of sales in own province. Note that province codes are: Newfoundland = 10, 
* PEI = 11, Nova Scotia = 12, New Brunswick = 13, Quebec = 24, Ontario = 35, 
* Manitoba = 46, Saskatchewan = 47, Alberta = 48, British Columbia = 59, Yukon = 60,
* Northwest Territories = 61, and Nunavut = 62. 
gen SalesFrac_OwnProv = c8400 if prov=="10"
replace SalesFrac_OwnProv = c8405 if prov=="12"
replace SalesFrac_OwnProv = c8410 if prov=="13"
replace SalesFrac_OwnProv = c8415 if prov=="11"
replace SalesFrac_OwnProv = c8420 if prov=="24"
replace SalesFrac_OwnProv = c8425 if prov=="35"
replace SalesFrac_OwnProv = c8430 if prov=="46"
replace SalesFrac_OwnProv = c8435 if prov=="47"
replace SalesFrac_OwnProv = c8440 if prov=="48"
replace SalesFrac_OwnProv = c8445 if prov=="59"
replace SalesFrac_OwnProv = c8451 if prov=="61"
replace SalesFrac_OwnProv = c8452 if prov=="62"
replace SalesFrac_OwnProv = c8455 if prov=="60"

* Revenue from sales in own province. Computed as fraction of sales in own province
* times total sales. 
gen OwnProvRevenue = (SalesFrac_OwnProv*real_vsm)/100
gen LN_OwnProvRevenue = ln(OwnProvRevenue)

* Indicator for plant that sells in own province. Takes a value of one for all plant-
* years with positive sales in own province. 
gen OwnProvIndicator = 1 if SalesFrac_OwnProv>0 & missing(SalesFrac_OwnProv)==0
recode OwnProvIndicator (.=0) if SalesFrac_OwnProv==0

* Fraction of output sold in other provinces outside own. Computed as the sum of 
* all sales in Canada and outside one's own province.
gen SalesFrac_OtherProv = SalesFrac_Canada-SalesFrac_OwnProv

* Revenue from sales to other provinces. Computed as the fraction of sales to other
* provinces times total sales. 
gen OtherProvRevenue = (SalesFrac_OtherProv*real_vsm)/100
gen LN_OtherProvRevenue = ln(OtherProvRevenue)

* Other province indicator
gen OtherProvIndicator = cond(OtherProvRevenue>0 & missing(OtherProvRevenue)==0,1,0)


* Fraction of sales sold in neighbour provinces (excluding territories. Neighbour 
* provinces are those that share a border. 
gen SalesFrac_NeighbourProv = c8440 if prov=="59"
egen SalesFrac_NeighbourProvAB = rowtotal(c8445 c8435) if prov=="48"
replace SalesFrac_NeighbourProv = SalesFrac_NeighbourProvAB if prov=="48"
egen SalesFrac_NeighbourProvSK = rowtotal(c8440 c8430) if prov=="47"
replace SalesFrac_NeighbourProv = SalesFrac_NeighbourProvSK if prov=="47"
egen SalesFrac_NeighbourProvMN = rowtotal(c8435 c8425) if prov=="46"
replace SalesFrac_NeighbourProv = SalesFrac_NeighbourProvMN if prov=="46"
egen SalesFrac_NeighbourProvON = rowtotal(c8430 c8420) if prov=="35"
replace SalesFrac_NeighbourProv = SalesFrac_NeighbourProvON if prov=="35"
egen SalesFrac_Atlantic = rowtotal(c8400 c8405 c8410 c8415)
egen SalesFrac_NeighbourProvQC = rowtotal(c8425 SalesFrac_Atlantic) if prov=="24"
replace SalesFrac_NeighbourProv = SalesFrac_NeighbourProvQC if prov=="24"
gen SalesFrac_NeighbourAtlanticND = SalesFrac_Atlantic-c8400
egen SalesFrac_NeighbourProvND = rowtotal(c8420 SalesFrac_NeighbourAtlanticND) if prov=="10"
replace SalesFrac_NeighbourProv = SalesFrac_NeighbourProvND if prov=="10"
gen SalesFrac_NeighbourAtlanticNB = SalesFrac_Atlantic-c8410
egen SalesFrac_NeighbourProvNB = rowtotal(c8420 SalesFrac_NeighbourAtlanticNB) if prov=="13"
replace SalesFrac_NeighbourProv = SalesFrac_NeighbourProvNB if prov=="13"
gen SalesFrac_NeighbourAtlanticNS = SalesFrac_Atlantic-c8405
egen SalesFrac_NeighbourProvNS = rowtotal(c8420 SalesFrac_NeighbourAtlanticNS) if prov=="12"
replace SalesFrac_NeighbourProv = SalesFrac_NeighbourProvNS if prov=="12"
gen SalesFrac_NeighbourAtlanticPEI = SalesFrac_Atlantic-c8415
egen SalesFrac_NeighbourProvPEI = rowtotal(c8420 SalesFrac_NeighbourAtlanticPEI) if prov=="11"
replace SalesFrac_NeighbourProv = SalesFrac_NeighbourProvPEI if prov=="11"


* Neighbour province revenues. Computed as fraction of sales to neighbours, times
* total sales. 
gen NeighbourProvRevenue = (SalesFrac_NeighbourProv*real_vsm)/100
gen LN_NeighbourProvRevenue = ln(NeighbourProvRevenue)
******
* -----------------------------------------------------------------------------



* -----------------------------------------------------------------------------
* Section 4: create weights for regressions.  

* Plants not-merged with the ASM data causes attenuation bias. This occurs because
* the probability of a succesful match depends on the plant's size, which causes 
* it to be correlated with the effect of treatment (treatment effects are 
* heterogenous). We correct for this by weighting each observation by the inverse 
* probability it was succesfully matched. 

* We compute match probabilities by bins of emissions for each pollutant (PM 2.5, PM 10,
* and NOx). We then compute the match probability by pollution bin for each pollutant. 
* We do this by dividing the number of observations in a bin in the matched data 
* to the number of observations in that bin in the full pollution dataset. The 
* inverse of this match probability is used as the weight for each plant in that 
* bin. 
* -----------------------------------------------------------------------------

********
* Create weights for PM 2.5 regressions using median PM 2.5 emissions. 
********
* Median log emissions for PM 2.5
bysort NPRI_ID: egen Median_PM25 = median(LN_total_air_converted) if cas_number=="NA - M10"


* Put plants into bins by median PM 2.5 emissions. 
gen Bin1_PM25 = cond(Median_PM25<-4 & missing(Median_PM25)==0,1,0)
gen Bin2_PM25 = cond(Median_PM25>=-4 & Median_PM25<-3 & missing(Median_PM25)==0,1,0)
gen Bin3_PM25 = cond(Median_PM25>=-3 & Median_PM25<-2 & missing(Median_PM25)==0,1,0)
gen Bin4_PM25 = cond(Median_PM25>=-2 & Median_PM25<-1 & missing(Median_PM25)==0,1,0)
gen Bin5_PM25 = cond(Median_PM25>=-1 & Median_PM25<0 & missing(Median_PM25)==0,1,0)
gen Bin6_PM25 = cond(Median_PM25>=0 & Median_PM25<1 & missing(Median_PM25)==0,1,0)
gen Bin7_PM25 = cond(Median_PM25>=1 & Median_PM25<2 & missing(Median_PM25)==0,1,0)
gen Bin8_PM25 = cond(Median_PM25>=2 & Median_PM25<3 & missing(Median_PM25)==0,1,0)
gen Bin9_PM25 = cond(Median_PM25>=3 & Median_PM25<4 & missing(Median_PM25)==0,1,0)
gen Bin10_PM25 = cond(Median_PM25>=4 & missing(Median_PM25)==0,1,0)

gen PM25_Bins = 1 if Bin1_PM25==1
replace PM25_Bins = 2 if Bin2_PM25==1
replace PM25_Bins = 3 if Bin3_PM25==1
replace PM25_Bins = 4 if Bin4_PM25==1
replace PM25_Bins = 5 if Bin5_PM25==1
replace PM25_Bins = 6 if Bin6_PM25==1
replace PM25_Bins = 7 if Bin7_PM25==1
replace PM25_Bins = 8 if Bin8_PM25==1
replace PM25_Bins = 9 if Bin9_PM25==1
replace PM25_Bins = 10 if Bin10_PM25==1

* Count number of merged plants by bin. These are used to compute weights. 
tab PM25_Bins if dupPolPlant<=1

* Define weights as inverse of match probability, by median emission bins. The 
* match probability is defined as the number of plants in each bin in the matched
* data divided by the number of plants in that same bin in the original NPRI data.
* The weights are computed in the Weight Computations for NPRI-ASM.xlsx file found
* in "E:\2016_EC_EnvRegulation\Stored Results\Weights".
gen Weights_PM25 = 1.35483871 if PM25_Bins==1
replace Weights_PM25 = 1.202898551 if PM25_Bins==2
replace Weights_PM25 = 1.633333333 if PM25_Bins==3
replace Weights_PM25 = 1.343891403 if PM25_Bins==4
replace Weights_PM25 = 1.297916667 if PM25_Bins==5
replace Weights_PM25 = 1.287390029 if PM25_Bins==6
replace Weights_PM25 = 1.201413428 if PM25_Bins==7
replace Weights_PM25 = 1.118644068 if PM25_Bins==8
replace Weights_PM25 = 1.153153153 if PM25_Bins==9
replace Weights_PM25 = 1.091603053 if PM25_Bins==10

drop Median_PM25 PM25_Bins Bin1_PM25 Bin2_PM25 Bin3_PM25 Bin4_PM25 Bin5_PM25 Bin6_PM25 Bin7_PM25 Bin8_PM25 Bin9_PM25 Bin10_PM25
********

********
* Create weights for NOx regressions using median NOx emissions. 
********
* Plants' median log emissions for NOx. 
bysort NPRI_ID: egen Median_NOx = median(LN_total_air_converted) if cas_number=="11104-93-1"


* Put plants into bins by median NOx emissions. 
gen Bin1_NOx = cond(Median_NOx<-2 & missing(Median_NOx)==0,1,0)
gen Bin2_NOx = cond(Median_NOx>=-2 & Median_NOx<-1 & missing(Median_NOx)==0,1,0)
gen Bin3_NOx = cond(Median_NOx>=-1 & Median_NOx<0 & missing(Median_NOx)==0,1,0)
gen Bin4_NOx = cond(Median_NOx>=0 & Median_NOx<1 & missing(Median_NOx)==0,1,0)
gen Bin5_NOx = cond(Median_NOx>=1 & Median_NOx<2 & missing(Median_NOx)==0,1,0)
gen Bin6_NOx = cond(Median_NOx>=2 & Median_NOx<3 & missing(Median_NOx)==0,1,0)
gen Bin7_NOx = cond(Median_NOx>=3 & Median_NOx<4 & missing(Median_NOx)==0,1,0)
gen Bin8_NOx = cond(Median_NOx>=4 & Median_NOx<5 & missing(Median_NOx)==0,1,0)
gen Bin9_NOx = cond(Median_NOx>=5 & Median_NOx<6 & missing(Median_NOx)==0,1,0)
gen Bin10_NOx = cond(Median_NOx>=6 & missing(Median_NOx)==0,1,0)

gen NOx_Bins = 1 if Bin1_NOx==1
replace NOx_Bins = 2 if Bin2_NOx==1
replace NOx_Bins = 3 if Bin3_NOx==1
replace NOx_Bins = 4 if Bin4_NOx==1
replace NOx_Bins = 5 if Bin5_NOx==1
replace NOx_Bins = 6 if Bin6_NOx==1
replace NOx_Bins = 7 if Bin7_NOx==1
replace NOx_Bins = 8 if Bin8_NOx==1
replace NOx_Bins = 9 if Bin9_NOx==1
replace NOx_Bins = 10 if Bin10_NOx==1

* Count number of merged plants by bin. These are used to compute weights. 
tab NOx_Bins if dupPolPlant<=1

* Define weights as inverse of match probability, by median emission bins. The 
* match probability is defined as the number of plants in each bin in the matched
* data divided by the number of plants in that same bin in the original NPRI data.
* The weights are computed in the Weight Computations for NPRI-ASM.xlsx file found
* in "E:\2016_EC_EnvRegulation\Stored Results\Weights".
gen Weights_NOx = 1.314285714 if NOx_Bins==1
replace Weights_NOx = 1.34375  if NOx_Bins==2
replace Weights_NOx = 1.25 if NOx_Bins==3
replace Weights_NOx = 1.309859155 if NOx_Bins==4
replace Weights_NOx = 1.210526316 if NOx_Bins==5
replace Weights_NOx = 1.151162791 if NOx_Bins==6
replace Weights_NOx = 1.147601476 if NOx_Bins==7
replace Weights_NOx = 1.14516129 if NOx_Bins==8
replace Weights_NOx = 1.109756098 if NOx_Bins==9
replace Weights_NOx = 1.071428571 if NOx_Bins==10

drop Median_NOx NOx_Bins Bin1_NOx Bin2_NOx Bin3_NOx Bin4_NOx Bin5_NOx Bin6_NOx Bin7_NOx Bin8_NOx Bin9_NOx Bin10_NOx
********

********
* -----------------------------------------------------------------------------



* -----------------------------------------------------------------------------
* Section 6: save analysis dataset. 

* Saves two versions. One full version with all observations, and one small
* version only with polluters. 
* -----------------------------------------------------------------------------
* Save dataset with all observations.
save "`datadir'asm_npri_2002to2012`x'_withAQ_cleanedfull.dta", replace

* Drop observations that aren't in the NPRI to save memory space. 
drop if missing(npri_id)==1

* Save analysis dataset with polluters only.
save "`datadir'asm_npri_2002to2012`x'_withAQ_analysis.dta", replace
* -----------------------------------------------------------------------------

}
